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Abstract 

£ — ■ | The paper considers gossip distributed estimation of a (static) distributed random field (a.k.a., large 

scale unknown parameter vector) observed by sparsely interconnected sensors, each of which only ob- 
serves a small fraction of the field. We consider linear distributed estimators whose structure combines the 
^ ' information flow among sensors (the consensus term resulting from the local gossiping exchange among 

sensors when they are able to communicate) and the information gathering measured by the sensors (the 
sensing or innovations term.) This leads to mixed time scale algorithms-one time scale associated with the 
consensus and the other with the innovations. The paper establishes a distributed observability condition 
(global observability plus mean connectedness) under which the distributed estimates are consistent 
• and asymptotically normal. We introduce the distributed notion equivalent to the (centralized) Fisher 

information rate, which is a bound on the mean square error reduction rate of any distributed estimator; 
we show that under the appropriate modeling and structural network communication conditions (gossip 
protocol) the distributed gossip estimator attains this distributed Fisher information rate, asymptotically 
achieving the performance of the optimal centralized estimator. Finally, we study the behavior of the 
distributed gossip estimator when the measurements fade (noise variance grows) with time; in particular, 
we consider the maximum rate at which the noise variance can grow and still the distributed estimator 
being consistent, by showing that, as long as the centralized estimator is consistent, the distributed 
estimator remains consistent. 

Keywords: Distributed estimation, gossip, random networks, sensor networks, link failures, switching 
topology 
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I. Introduction 

A. Motivation 

We consider distributed (or decentralized) estimation of a random field where observations are collected 
by possibly a large number of sparsely internetworked sensors. The network operates under the gossip 
random protocol and may be subject to random infrastructure failures (communication channels may fail 
intermittently.) There is no fusion-center and the estimation is performed locally at each sensor with inter- 
sensor message exchanges occurring at random times. Because the random field of interest is distributed, 
each sensor can only observe a part of the field, and no sensor can in isolation obtain a reasonable estimate 
of the entire field. This paper studies the conditions under which the distributed algorithms operating under 
the random intermittent conditions (gossip and link failures) that we consider can achieve (asymptotically) 
performance that is equivalent to the estimation performance of centralized optimal algorithms. To be 
more concrete and as an abstraction of the environment 1 , we model it by a static vector parameter, whose 
dimension, M, can be arbitrarily large. Each sensor's observations, say for sensor n, are M n dimensional 
noisy measurements of a part of the (static random) field, where M n <C M. We assume that the sensing 
rate, i.e., rate of receiving observations at each sensor, is comparable to the communication rate among 
sensors, so that sensors update their estimate at time index i by fusing appropriately their current estimate 
with the observation (innovation) at i and the estimates at i received from those sensors with which it 
successfully gossips at i. Because of the communication intermittency, the distributed estimators that we 
consider exhibit mixed time scales: one associated with the consensus, i.e., mixing estimation updating 
resulting from receiving the estimates from the neighbors; and the other associated with the sensing or 
estimation updating from the innovations. In this paper, we consider a general class of linear distributed 
gossip networked estimators and study the conditions under which they exhibit the same estimation error 
convergence rate as a centralized linear field estimator. Nonlinear distributed estimators and distributed 
estimation of time varying random fields under the gossip protocol are considered elsewhere, [1] and [2], 
respectively. 

We discuss the major challenges in gossip distributed estimation and highlight the key contributions 
of the paper: 

• Infrastructure failures and gossip communication: The inter-sensor communication may be band- 
width and power constrained and subject to random environmental conditions. For example, the 
sensors may share a common wireless medium and, due to competing objectives, the inter-sensor 
transmissions may be scheduled by the underlying MAC (Medium Access Control) layer to occur 
at random times; in fact, in many situations of interest, the exact medium access (MAC) protocol 

'The term environment or field has a generic usage here. It may correspond to sensors deployed over a domain of interest 
like a temperature surface, or, a networked physical system instrumented with sensors. Typical examples of the latter include 
cyberphysical systems like the power grid, and networked control systems (NCS), where a network of distributed actuators are 
equipped with sensors. 
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(randomized) is not known or determined apriori, the inter-sensor communications is asynchronous, 
and random data packet dropouts may occur. 

• Distributed observability: It is well known that centralized estimation requires observability condi- 
tions to be satisfied for the estimation task to be successful 2 . As we will see, formulating a satisfactory 
notion of distributed observability is not trivial. A difficulty stems from the distributed nature of 
the information, i.e., sensors observe only a portion of the field of interest. The incorporation of 
estimate fusion among the sensor nodes {consensus) together with local innovation updates suggest 
that distributed observability should be not only a function of the sensor observations, but closely 
tied to the structural properties of the communication network governing the information flow. These 
conditions are sensitive to the pattern of information dissemination in the network and depends on 
the level of node cooperation, for example, gossiping. We present minimal conditions for distributed 
observability, namely, for example, in the case of full cooperation (each node exchanges its entire 
estimate with its neighbors), we show that global observability? and mean connectedness of the time 
varying communication graph are sufficient to ensure consistent parameter estimates at each sensor. 

• Distributed versus optimal centralized estimation: We show that under reasonable assumptions, 
the gossip distributed estimators we develop, like the centralized optimal estimator, lead to consistent 
parameter estimates at each sensor. The natural question of interest is to compare the rate of 
convergence of these schemes to the true parameter value. We adopt asymptotic normality and 
the associated asymptotic variance as the metric for comparing different estimators. It is known 
from the theory of recursive estimation (centralized), that the optimum centralized estimator (under 
reasonable assumptions) achieves asymptotic variance equal to the Fisher information rate. In this 
paper, we formalize a notion of distributed Fisher information rate, i.e., a lower bound on the 
asymptotic variance of all distributed schemes and also investigate the existence of optimal distributed 
estimators achieving this lower bound. It turns out that, if the inter-sensor communication is noisy or 
quantized, the asymptotic variance of distributed estimators is always higher than their centralized 
counterpart. On the other hand, a remarkable asymptotic time scale separation phenomenon shows 
that, in the absence of channel noise or quantization (but presence of random link failures and 
gossip,) there exist distributed estimation schemes whose asymptotic variance equals the centralized 
Fisher information rate under pragmatic conditions. In particular, it is shown that, in a Gaussian 
environment, a distributed estimator is equivalent to a centralized one in terms of asymptotic variance, 
and, more generally, equivalent to the best linear centralized estimator. This is significant, as it shows 
that, under reasonable assumptions, a distributed gossip estimator is as good as a centralized one, the 
latter having access to all sensor observations at all times. We present some intuitive remarks. In a 

Successful means the estimate sequence generated over time possesses desirable properties like consistency, asymptotic 
normality etc. 

3 Global observability corresponds to the centralized setting, where an estimator has access to the observations of all sensors 
at all times. The assumption of global observability does not mean that each sensor is observable; rather, that if there was a 
centralized estimator with simultaneous access to all the sensor measurements, this centralized estimator would be observable. 
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centralized recursive (parameter) estimation scheme, the estimate update rule involves combining the 
past estimate with the new innovation (observation), the key design parameter being the time varying 
gain or weight associated to the innovation term. Since, the observations are noisy, for parameter 
estimation, this weight sequence needs to go to zero for achieving convergence and, in fact, needs 
to be square summable to constrain the effect of the observation noise. In most cases, assuming 
independent observations over time, the innovation gains decrease as l/i (i being the iteration or 
time index) for optimal estimation performance. This means that the estimation uncertainty cannot 
be reduced at a rate a consequence of central limit theorem type arguments. Now, consider 

the distributed scheme. Here, the algorithm design involves two gain sequences, one for the local 
innovations at each sensor and the other for estimate fusion (consensus) across sensors. To design 
good performance distributed gossip estimators, the trick is in choosing the fusion or consensus 
gain properly, so that its effect decays at a slower rate than the innovation gain. In the absence of 
quantization or channel noise, it is possible to choose the consensus weight sequence such that its 
squared sum goes to oo, in contrast to the innovation weight sequence whose squared sum needs to 
be finite. It is shown that this tuning of the different gain sequences leads to an asymptotic time scale 
separation, the rate of information dissemination dominating the rate of reduction of uncertainty by 
observation acquisition. This tuning is not possible in the case of quantized or noisy transmissions, 
as each consensus step introduces noise, preventing proper adjustment of the gain sequences. The 
analysis approach that we develop is of independent interest and contributes to the theory of mixed 
time scale stochastic approximation. 4 Related to our mixed time scale algorithms is the work [4], 
which develops methods to analyze such algorithms in the context of simulated annealing. In [4] the 
role of our innovation potential is played by a martingale difference term. However, in our paper, an 
additional difficulty with respect to [4] is that the innovation is not a martingale difference process, 
and so a key step in our analysis is to derive pathwise strong approximation results to characterize 
the rate at which the innovation process converges to a martingale difference process. 
Brief review of the literature. We comment on the relevant literature. An early treatment of distributed 
stochastic algorithms appears in [5] (see also [6], [7], [8].) In [5], almost sure convergence is established 
for a class of distributed stochastic algorithms in the context of distributed optimization. This line of 
work assumes the existence of a fixed time window T, such that the union of communication graphs 
over any interval of length T is connected with probability one. Also, the stochastic noise appears only in 
the computation of the local gradients that play the role of innovations in our approach. The conditions 
imposed on the local gradients are rather strong and implicitly assume that the individual processor 
(sensor in our terminology) dynamics are stable. Some of these conditions are relaxed in [8], which 
derives almost sure convergence and asymptotic normality for a class of constrained and unconstrained 

4 By mixed time scale, we refer to stochastic algorithms where two potentials act in the same update step with different weight 
or gain sequences. This should not be confused with stochastic algorithms with coupling (see [3]), where a quickly switching 
parameter influences the relatively slower dynamics of another state, leading to averaged dynamics. 
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parallel and communicating stochastic procedures with perfect communication. On the contrary, the gossip 
distributed estimators we develop in this paper are general mixed time-scale procedures in generic random 
environments and provide pathwise strong convergence rates. Our work does not impose local conditions 
on the innovation processes and develops and infers connective stability based on structural network 
conditions and global observability and establishes strong invariance results relating network information 
flow and the effect of local innovations. 

More recently, there has been renewed interest in distributed approaches motivated by wireless sensor 
networks (WSN) applications. The papers [9], [10], [11], [12] study the estimation problem in static 
networks, where either the sensors take a single snapshot of the field at the start and then initiate distributed 
consensus protocols (or more generally distributed optimization, as in [10]) to fuse the initial estimates, 
or the observation rate of the sensors is assumed to be much slower than the inter-sensor communicate 
rate, thus permitting a separation of the two time-scales. More relevant to our work are [13], [14], [15], 
[16], which consider the linear estimation problem in non-random networks, where the observation and 
consensus protocols are incorporated in the same iteration. In [13], [15], the distributed linear estimation 
problems are treated in the context of distributed least-mean-square (LMS) filtering, where constant weight 
sequences are used to prove mean-square stability of the filter. The use of non-decaying combining weights 
in [13], [15], [16] leads to a residual error; however, under appropriate assumptions, these algorithms can 
be adapted for tracking certain time-varying parameters. The distributed LMS algorithm in [14] considers 
decaying weight sequences, thereby establishing £2 convergence to the true parameter value. In contrast 
to these, our work quantifies the pathwise information dissemination rate and its relation to the innovation 
rate by studying general mixed time-scale procedures. We consider structural conditions based on the 
network topology and observation pattern to develop a satisfactory notion of distributed observability and 
provide fundamental limits on the performance of distributed schemes. 

The key difference between the current paper and the linear algorithm CIA in [1] involves the use of 
different weight sequences for the consensus and the innovation terms, giving to the linear distributed 
estimators here a mixed time scale behavior. On the other hand, in this paper, we assume unquantized 
transmissions in the distributed gossip estimators. Another difference that will be noted below is the 
incorporation of a general matrix gain K into the innovation update. These modifications make the 
technical analysis of the distributed gossip linear estimators in this paper highly non-trivial and very 
distinct from the analysis of CIA in [1]. 

We briefly comment on the organization of the rest of the paper. Section I-B sets up notation and 
preliminary concepts to be used throughout the paper. Section II formulates the distributed estimation 
problem, introduces the algorithm QC1A and the assumptions (Section II-A.) Some technical results on 
the convergence of stochastic recurrences are established in Section III. This section also considers some 
properties of centralized estimators, with which we compare our distributed scheme. The main results 
of the paper are stated in Section IV. Section V develops convergence properties of the QCU algorithm, 
leading to the proofs of the main theorems in Section VI. Finally, Section VII concludes the paper. 
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B. Notation 



We denote the /c-dimensional Euclidean space by R k . The set of m x n matrices with real entries 
is denoted by R mxn . S N ,§>+, §+ + refer to the subsets of symmetric, positive semidefmite, positive 
definite matrices in M ArxAr respectively. The k x k identity matrix is denoted by Ik, while lk, Ok denote 
respectively the column vector of ones and zeros in M k . The set of integers is denoted by T, whereas N 
stands for the natural numbers. T + denotes the set of nonnegative integers and indices the iteration time 
slots throughout the paper. 

Define the rank one k x k matrix by 



The only non-zero eigenvalue of Pk is one, and the corresponding normalized eigenvector is ( 1/yk) lk- 



The operator ||-|| applied to a vector denotes the standard Euclidean 2-norm, while applied to matrices 
denotes the induced 2-norm, which is equivalent to the matrix spectral radius for symmetric matrices. 

We assume that the parameter to be estimated belongs to a subset U of the Euclidean space M M . 
Throughout the paper, the true (but unknown) value of the parameter is denoted by 9*. We denote a 
canonical element of U by 9. The estimate of 9* at time i at sensor n is denoted by x n (z) G R Mxl . 
Without loss of generality, we assume that the initial estimate, x n (0), at time at sensor n is a non-random 
quantity. 

Throughout, we assume that all the random objects are defined on a common measurable space, (Q, F). 
In case the true (but unknown) parameter value is 9*, the probability and expectation operators are denoted 
by P#. [■] and E#* [■], respectively. When the context is clear, we abuse notation by dropping the subscript. 
Also, all inequalities involving random variables are to be interpreted a.s. (almost surely.) 

Spectral graph theory. We review elementary concepts from spectral graph theory. For an undirected 
graph G = {V, E), V = [1 ■ ■ ■ N] is the set of nodes or vertices, \V\ = N, and E is the set of edges, 
| .El = M, where | • | is the cardinality. The unordered pair (n, I) G E if there exists an edge between 
nodes n and I. We only consider simple graphs, i.e., graphs devoid of self-loops and multiple edges. A 
graph is connected if there exists a path 5 , between each pair of nodes. The neighborhood of node n is 



Node n has degree d n = \£l n \ (number of edges with n as one end point.) The structure of the graph 
is described by the symmetric N x N adjacency matrix, A = [A n r], A n i = 1, if [n, I) G E, A n \ = 0, 
otherwise. The degree matrix is the diagonal matrix D = diag (di ■ ■ ■ cZjv). The graph positive semi-definite 

5 A path between nodes n and I of length m is a sequence (n = io, ii, ■ ■ ■ , im = I) of vertices, such that, (ik,ik+i) £ 
BV0<Km-l. 




(1) 




Q n = {l e V\(n,l) G E} 



(2) 
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Laplacian matrix, L, and its ordered eigenvalues are 

L = D — A (3) 
= Ai(L) < A 2 (L) < ••• <Aat(L) (4) 

The smallest eigenvalue Ai(Z) is always equal to zero, with (l/y/~N^ ljy being the corresponding nor- 
malized eigenvector. The multiplicity of the zero eigenvalue equals the number of connected components 
of the network; for a connected graph, A 2 (L) > 0. This second eigenvalue is the algebraic connectivity 
or the Fiedler value of the network; see [17], [18], [19] for detailed treatment of graphs and their spectral 
theory. 

Kronecker product: Since, we are dealing with vector parameters, most of the matrix manipulations 
will involve Kronecker products. For example, the Kronecker product of the N x N matrix L and 1m will 
be an NMxNM matrix, denoted by L®I M - Denote the NMxNM matrix P NM = P n ®Im = ^(Ijv® 
Im){1-n®Im) T ■ We will deal often with matrices of the form C = [Inm — bL Im — al^M — P NM ~\ , 
L being a graph Laplacian matrix. It follows from the properties of Kronecker products and the matrices 
L,P NM , that the eigenvalues of this matrix C are —a and 1 — b\ n (L) — a, n < i < N, each being 
repeated M times. 

II. Problem Formulation 

Let 9* £ R Mxl be an M-dimensional parameter that is to be estimated by a network of N sensors. 
We refer to 9 as a parameter, although it is a vector of M parameters. Each sensor makes independent 
observations of noise corrupted linear functions of the parameter. We assume the following observation 
model for the n-th sensor: 

z n (i) = H n (i)9* +1 (i)( n (i) (5) 

where: {z n (i) G M M ™ xl } i>0 is the independent observation sequence for the n-th sensor; {CnW}j>o i s a 
zero-mean i.i.d. noise sequence of bounded variance. For most practical sensor network applications, each 
sensor observes only a subset of M n of the components of 9, with M n <C M. Under such a situation, in 
isolation, each sensor can estimate at most only a part of the parameter. However, if the sensor network 
is connected in the mean sense (see assumption (A.3)), and under appropriate observability conditions, 
we will show that it is possible for each sensor to get a consistent estimate of the parameter 9* by means 
of local inter-sensor communication. 

We formalize the assumptions on global observability, fading signal characteristics and network con- 
nectivity: 

• (A.l)Observation Noise: Recall the observation model in eqn. (5). We assume that the process, 
= [Ci (i) , • • • >CjvW]f i s an i-i-d- zero mean process, with finite second moment. The 

I. J i>0 

observation noise process, {"f(J)C(i)}^ then has non-stationary (in general) characteristics, with 
variance increasing as 7 2 (z) over time. The non-decreasing sequence models the fading 
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characteristics of the parameter (signal) over time. In particular, the regime 7(2) — > 00 corresponds 
to the SNR decreasing as l/~f 2 (i) over time, whereas, 7(1) = 1 for all i recovers the case of 
i.i.d. (constant SNR) observations. Also, note that the observation noises at different sensors may 
be correlated during a particular iteration, we require only temporal independence. The spatial 
correlation of the observation noise makes our model applicable to practical sensor network problems, 
for instance, for distributed target localization, where the observation noise is generally correlated 
across sensors. 

The following assumption on the growth rate of {7(2)} is imposed throughout: 
There exists, < 70 < .5, such that, 

7(1) = + ViGT+ (6) 

In other words, we assume that the observation noise variance has sublinear growth. The sublinear 
growth assumption is not restrictive, and as shown in Remark 8 is in fact, necessary for centralized 
estimators to yield consistent estimates of the parameter. 

• (A.2)Observability: We require the following global observability condition. The matrix G 

N 

G = Y,H T n H n (7) 

n=l 

is full-rank. This distributed observability extends the observability condition for a centralized 
estimator to get a consistent estimate of the parameter 6*. 

• (A.3)Random Link Failure: In digital communications, packets may be lost at random times. To 
account for this, we let the links (or communication channels among sensors) to fail, so that the 
edge set and the connectivity graph of the sensor network are time varying. Accordingly, the sensor 
network at time i is modeled as an undirected graph, G(i) = (V, E(i)) and the graph Laplacians as 
a sequence of i.i.d. Laplacian matrices {L(i)} i>0 . We write 

L(i) = L + L{i), Vi > (8) 

where the mean L = E [£(*)]. We do not make any distributional assumptions on the link failure 
model. Although the link failures, and so the Laplacians, are independent at different times, during the 
same iteration, the link failures can be spatially dependent, i.e., correlated. This is more general and 
subsumes the erasure network model, where the link failures are independent over space and time. 
Wireless sensor networks motivate this model since interference among the wireless communication 
channels correlates the link failures over space, while, over time, it is still reasonable to assume that 
the channels are memoryless or independent. 

Connectedness of the graph is an important issue. We do not require that the random instanti- 
ations G{i) of the graph be connected; in fact, it is possible to have all these instantiations to 
be disconnected. We only require that the graph stays connected on average. This is captured by 
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requiring that A2 (L) > 0, enabling us to capture a broad class of asynchronous communication 
models; for example, the random asynchronous gossip protocol analyzed in [20] satisfies A2 (L) > 
and hence falls under this framework. 

• (A.4)Independence Assumptions: The sequences {L(i)} ieT+ and {CW}igT + ^ mutually inde- 
pendent. 

In Section II-A, we present the algorithm QCU for distributed parameter estimation with the linear 
observation model (5). Starting from some initial deterministic estimate of the parameters (the initial 
states may be random, we assume deterministic for notational simplicity), x n (0) £ R Mxl , each sensor 
generates by a distributed iterative algorithm a sequence of estimates, {x ra (i)} i>0 . The parameter estimate 
x n (z+l) at the n-th sensor at time i+1 is a function of: its previous estimate; the communicated estimates 
at time i of its neighboring sensors; and the new observation z n (i). 

A. Algorithm QCU 

Algorithm QCU: Consider the parameter estimation problem with linear observation model (assump- 
tions (A.1)-(A.2)). Let x(0) = [xi(0) T , • • • ,xtv(0) t ] t be the initial estimates of 8* at the sensors. The 
QCU algorithm updates the estimate x n (i) at sensor n according to the following: 

■x n (i + \)=x n (i)-p{%) £ {* n (i)-Ki{i)) + a{i)KH T n (z„(*)-iT n x n (*)) (9) 

le£i„(i) 

The key difference between the above scheme and the CU in [1] involves the use of different weight 
sequences for the consensus and the innovation terms, giving the former a mixed time scale behavior. On 
the other hand, we assume unquantized transmissions in QCU. Another difference is the incorporation 
of a general matrix gain K into the innovation update. These modifications make the technical analysis 
of QCU highly non-trivial and different from that of CU, mostly due to the incorporation of mixed time 
scale dynamics. 

In a compact notation, QCU may be written as: 

x(i + 1) = x(i) - (L{i) I M ) x(i) + a(i) (J* K) (z(i) - 2%x(i)) (10) 

We refer to the class of distributed recursive estimation algorithms in (9) as QCU. As will be shown, 
different choices of the weight sequences {a(i)}, {/3(i)} lead to different convergence characteristics of 
QCU, hence the usage of the term 'class of algorithms'. In the following, we introduce some additional 
moment requirements and assumptions on the algorithm weight sequences: 

• (A.5)Moment Condition: There exists e\ > 0, such that, the following moment exists: 

e4||C«II 2+£i 1 <oo (li) 
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The above implies the existence of a positive function such that, 



Djjz(i) - 1 N <8> ( ( jjIn ® Im ) ^h z (^ 



2+ £l 



< 7 2+£l (i)Ki(^) < oo 



(12) 



for all i £ T_|_. We thus assume the existence of slightly greater than quadratic moment of the 
observation noise process. 

(A.6) Weight sequences: The sequences {a(i)} and {/?(*)} are of the form: 

a „,. s b 



a(i) = 



(i + l) Tl ' " vv (i + l) T2 
where a, 6 > 0, < r 2 < t\ < 1. In addition, the weights satisfy the following condition: 

1 

2 + £! 



ti > max .5 + 7o, r 2 + 70 + 



(13) 



(14) 



where max(-) denotes the maximum of .5 + 70 and r 2 + 70 + 2Ti7- 

The gain matrix K is assumed to be positive definite. To avoid unnecessary technicalities, we also 
assume that the matrices K and G commute, so that, KG is symmetric positive definite (see [21]). 
Recall, G to be the invertible Grammian Yln=i ^n^n- 

Remark 1 We comment on the QCU assumptions. First, we note that the moment assumption is not 
restrictive, and most reasonable noise models possess moments of sufficiently high order. Also, it is easy 
to come up with a choice of algorithm parameters (ti,T2) given a < 7 < .5. In fact, any choice 
of n > .5 + 70 suffices, as one can choose r 2 satisfying < r 2 < 17 — .5 — 70. That, this choice 
satisfies assumption (A.6) ((14)), is due to the fact, that, ^q^- < -5 for any e\ > 0. Finally, a note on 
nomenclature. Often, we will use the term (n, a, r 2 , b, K)-QCU algorithm to indicate explicitly the QCU 
design parameters in force. 



Markov. Consider the filtration, {^}j> , given by 

Tf = a(x(0),{L(j)X(j)} o < j<l ) 



(15) 



It then follows that the random objects L(i),z(i) are independent of Tf, rendering {^(i),^}^ a 
Markov process. 



B. Centralized linear estimators 

The key focus of the paper is to compare the performance achieved by the class of QCU algorithms 
to centralized estimation schemes 6 . Specifically, we will restrict this comparison to linear centralized 
estimators only. To this end, we start by defining a reasonable (to be clear soon) class of centralized 

6 A centralized scheme corresponds to a fusion center having access to all sensor observations at all times. 



11 



linear 7 of the parameter 6. 

Definition 2 (Centralized linear estimator) A centralized linear estimator is a process {u(i)}i e j + evolv- 
ing as 

n N 

u(i + 1) = u(i) + J] (^ n (i) - H T n H n n{i)) (16) 

n=l 

Here, we assume that the weight sequence {a c (i)} is of the form 

a c (i) = - a \ (17) 
cW (i + l) Tc 

for some a c > and r c > 0. Also, K c is a positive definite gain matrix that commutes with the Grammian 
G. 

A centralized linear estimator is called good, if in addition the design parameter satisfies 

•5 + 70<r c <l (18) 

Remark 3 We comment on the above definition and justify the nomenclature good. Clearly, different 
choices of the gain matrix K c and the weight sequence {a c (i)} would lead to different convergence 
properties of the estimator {u(i)}. As shown in Proposition 7, the condition .5 + 70 < r c < 1 is 
necessary and sufficient for the estimator {u(i)} to be universally 8 consistent from all initial conditions. 
In particular, the best linear centralized estimator assumes the form in Definition 2 (for a specific choice 
of K c and {a c (i)}.) Hence, for all purposes, it is sufficient to compare the distributed algorithm QCU 
with the class of good centralized estimators defined above. In the following, we will restrict attention to 
good centralized estimators only, and will often drop the term good when referring to these estimators. 
Also, similar to the distributed QCU estimators, we will use the term (r c , a c , K c ) centralized estimator 
to indicate explicitly the design parameters in force. 

Before proceeding to the convergence analysis of QCU under assumptions (A.1)-(A.6), we establish 
some properties of general stochastic recursions to be used in the sequel. 

III. Some Intermediate Results 

We establish three approximation results to be used later. The first one (Lemma 4) is a stochastic 
analogue of Lemma 18 in [1], the second one (Lemma 5) quantifies the pathwise convergence rate in 
Lemma 4. Lemma 6 is a time-varying mixed time scale version of Lemma 3 in [1]. Finally, we end this 
section by listing some convergence properties of the centralized estimators (Definition 2.) 



7 Since we deal with linear centralized estimators only, in the following we drop the term linear when referring to centralized 
estimators. 

8 By universal consistency of an algorithm, we mean that the algorithm leads to consistent estimates of the parameter 8 
irrespective of the observation noise distribution, as long as the moment assumption (A. 5) is satisfied. 
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Lemma 4 Consider the scalar time-varying linear system: 

y(i + 1) = (1 - ri(i))y(i) + r 2 (i) 



(19) 



Here {ri(i)} is a sequence of independent random variables, such that, < r±(i) < 1 a.s. with mean 



ai 



(i + 1) 5 > 

and ai > 0, < S± < 1. Also, assume y(0) > and the sequence {r 2 (i)} is given by 

a 2 



r 2 (i) 



(i + I) 52 



where a 2 > 0, S 2 > 0. Then, if S\ < 5 2 , 



lim y(i) = a.s. 



(20) 



(21) 



(22) 



Proof: The assumptions imply that the sequence {y(i)} is non-negative. Define the process {Vi(i)} 



by 



i-l 



fc=o L \/=fc+i 

Since (5i < c>2, an application of Lemma 18 in [1] yields 



i-l 



J] (1-^(0) ra(fc) 



(23) 



i-l 



lim > 

7 — S-rv^ ^ — ^ 



k=0 



i-l 



II (l-ri(0) r2(fc) 
vz=fc+i / 







(24) 



Hence, in particular, the second term on the R.H.S. is bounded and {y(i)} is well defined. Denote by 
{^(i)} the natural filtration of the process {y(i)} and note that is adapted to this filtration. 

Using the fact, that 



fl (l-ri(O) U(*0 



fc=0 L V=fc+1 / 

we have, by the independence condition, 



i-l 

E 

fc=0 



i-l 



II (l-ri(Z)) U(fc) 
v«=fe+i / 



+ r 2 (i) (25) 



E[Vi(i + 1) | J^ii)} = E[y(i + l)\Ty(i)]-J2 

k=0 

= (l-ri(i))y(i)+r 2 (i)-5^ 



II (l-ri(0) 



v/=fe+i 



i-l 



(l-ri(i))y(i)-X; 



fc=0 



II (l-ri(Z)) U(fc) 
k=o L \Z=fc+l / 

n (i-^w)) r*w 

=k+l / 



Vi{i) - ri(i)y(i) 



(26) 
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The nonnegativity of {y(i)} implies 



E[Vi(i + l) | J™(i)] < V^i) 



(27) 



Hence {Vi(z)} is a supermartingale. The nonnegativity of {y(i)} and the boundedness of the terms 



X^fc=o (rn=fc+i(l ~~ r i(0)J r 2(k) for all i show that {VI (i)} is bounded from below. It then follows 
that there exists a finite random variable Vf, such that, 



lim Vi(i) = V? a.s. 

i— >oo 



We then have 



i-l 

lim y(i) = lim V±(i) + lim > 

i— >oo i^oo i— >oo — ' 

fc=0 

= V? 



i-l 



[J (l-n(O) U(fc) 

v/=fc+l / 



(28) 



(29) 



Since y(0) is deterministic, the sequence {y(i)} is integrable and we have 



i-l 



E|y(<)]= II^-^W) »(0) + E 



vfc=0 



i-l 



Q (l-ri(O) 



fc=0 L V=fc+l 



An application of Lemma 18 in [1] then shows 



(30) 



lim E [y(i) 

i— >oo 







(31) 



and by Fatou's lemma we conclude E [V*] = 0. Since, V* is nonnegative, being the limit of the 
nonnegative sequence {y{i)}, we have 

V? = a.s. (32) 

and the claim holds. ■ 
We will also use the following result, which characterizes the convergence rate in the above. The proof 
is somewhat similar to the arguments in Lemma 4 and we omit it due to space limitations. 

Lemma 5 Consider the scalar deterministic time-varying linear system: 

y(i + l) = (l-r 1 (i))y(i) + r 2 (i) (33) 



where the sequences {r\(i)} and {^(i)} satisfy the hypothesis of Lemma 4. 

• (1) Then, if 5\ < 82 and 5\ < 1, 

lim (i + l) So y(i) = 

i— >oo 

for all < S < S 2 - Si. 

• (2) Let 61 < S2 and <5i = 1. Then the above conclusion holds, if in addition a\ > Sq. 

• (3) All the above remain valid when n (i) is random satisfying the conditions of Lemma 4. 



(34) 
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Lemma 6 Under the stated assumptions, there exists i\ sufficiently large and a constant C4 > 0, such 
that, for i > i\, 

y T {m'L®I + a{i)(I N ®K)D 1I )y>c A a{i)\\y\\ 2 , Vy € R NM (35) 

Proof: The key difference from the proof of Lemma 3 in [1] is that, the matrix (f3(i)L <g> I + a(«)(/jv <8> K)Djj) 
is not symmetric. We first show that the quadratic form 

Y T ®I + (I N ® K)D^j y (36) 

is strictly greater than zero for all y 6 R NM satisfying ||y|| = 1 for all sufficiently large i. To this end, 
for such y, consider the decomposition 

y = yc + yc-L (37) 

Define the symmetric matrix Dk by 

Dk = \ [(In ® K)Dg] + \ [(I N K)Dg] T (38) 

Noting that 

y T [(I N ® K)Djj] y = y T D K y (39) 



we have 



^(|l,/ + (^%)y = y^L^ + ^y 



y T (^T®i)y + y T D K y 



= yc± [^j L ® T )yc^+ yc^Dkyc^ 

+2yc ± D K y c + YcDkYc 

> ^\ 2 (L)\\y c ±\\ 2 + y^D K y c ± 
a(i) 

+2y£ L £>#y c + YcD K y c (40) 
Now, the symmetricity of Dk implies the existence of a constant C15 > 0, large enough, such that, 

yc^ D KYc± > — C15 llyc- 1 - II 2 (41) 
yc±D K y c > -ci 5 ||y c || \\yc^\\ (42) 
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Also, using the form yc = In <8> a> for some a G K M , we note that 

y£ [(/jv ® K)^] yc 



YC D KYC 



N 

Y J ^ T KH n a, 

n=l 

aKGa 



> A 



_ '\min 



A, 



mm ii m2 

^llycll 



(43) 



where the last but one step uses the fact, that the matrix KG is positive definite, as both K and G are 
positive definite and they commute. Note, in particular, that A m i n > 0. Substituting the above in eqn. (40), 
we have 



a(i) 



L®I + (I N ® K)Dg y > 



a(i) 



M{L) - ci 5 ||y c ^|| 2 - 2ci 5 ||yc|| \\YcA\ + 



A 



mm ii ii 2 

yell 



N 



Since lim^oo j3(i)/a(i) = oo (ji < n), we can choose ii large enough, such that, for i > io 



m 

a(i) 



M L ) - C15 > 



AT 



#0 



at 



Cl5 



> q 5 



(44) 

(45) 
(46) 



We now verify the claim in eqn. (36) for i > i\. Clearly, if yc = 0, the quadratic form reduces to 



a(i) 



L®I + (I N ® K)Djj y > 



^A 2 (L) - c 15 ) ||y C x || 2 = ® A 2 (L) - c 15 > (47) 
/ a(i) 



(Note that, the constraint that y lies on the unit circle forces ||yc^|| to be 1, if y^ = 0.) On the other 
hand, if yc > 0, we have 

1 2 



a(i) 



L®I+(I N ® K)D W y > ||y c 



a(i) 



M{L) - C15 



|yc^ 



|yc| 



2C15 



|yc j 



yc 



+ 



An 



N 



(48) 

The term on the R.H.S. is always strictly greater than zero by the discriminant condition of eqn. (45). 

The assertion in eqn. (36) thus holds. Since the quadratic form is a continuous function of y, its 
positivity on the unit circle implies, there exists C4 > 0, such that, 



inf y 

l|y||=i 

It then follows that, for all y G R NM 



L ® I + (I N <g> K)Dg ) y > c\ >0 



^-L ®I+(I N ® K)Djj ) y > c 4 ||y|| 2 
a{i) 1 



(49) 



(50) 
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and hence 



y T {P{i)L® I + a{i){I N ®K)D 1I )y = a(i)y T ( ^-L I + (I N K)D^\ y 



> a(i)c4 ||y 




(51) 



for i > i\. 



Note that, the condition lim^,^ /3(i)/a(i) = oo is required for Lemma 6. 

The following proposition justifies the nomenclature good in Definition 2. In particular, it shows that 
under assumptions (A.1),(A.2),(A.5), there exists a noise distribution (Gaussian), such that, the centralized 
scheme is not consistent if r c fails to satisfy the requirement (18). 

Proposition 7 (1) Suppose the process {C(*)} is Gaussian. Consider the centralized estimator {u(z)}. 
Then, if r c < 70 + .5 or r c > 1, the sequence {u(i)} is not consistent from arbitrary initial condition 



(2) Let assumptions (A.1),(A.2),(A.5) hold. Then, a good centralized estimator is consistent (universally) 
from all initial conditions. 

(3) Let assumptions (A. 1),(A.2),(A.5) hold. Consider a good centralized estimator with design parameters 
(t c , a c , K c ). Then, there exists a (77, a, T2, b, K)-QC1A estimator, such that, t\ = r c , a = a c , K = K c . 

Remark 8 As a consequence of the first assertion, we note that, for a centralized linear estimator to 
achieve consistency, the parameter 70 should be strictly less than .5. 

Proof: Due to space limitations, we omit the proof which follows from standard properties of 
stochastic recurrences and approximation ([22]). 

We present an intuitive sketch of the proof of the first assertion. From (16), we note that, at time i, an 
observation noise is incorporated on the right hand side (R.H.S.) with variance of the order (^_|_i) 2 7o-2tau c 
Clearly, if r c < .5 + 70, as i — > 00 the cumulative noise adds up to 00. For Gaussian noise, this would 
lead to unboundedness of the estimate sequence {u(i)}. This explains the lower bound in the choice of 
t c . On the other hand, if r c > 1, the {a c } becomes summable and the updates die out quickly. Hence, 
depending on the initial estimate u(0), it may not be possible to progress towards 6*. Thus, in general, 
we need r c < 1. 

The second assertion follows from standard stochastic approximation arguments (see, for example [22] 
and Theorem 1 in [23].) 

The third assertion simply states that there exists a choice of T2 satisfying assumption (A.6), when 
Ti = r c and K = K c . This is immediate from Remark 1. ■ 



In the case 70 = 1, i.e., the observation process is stationary (constant SNR), the following property 
of {u(i)} holds: 



u(0). 
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Proposition 9 Suppose 70 = and assumptions (A.1),(A.2),(A.5) hold. Then, in addition to the consis- 
tency in Proposition 7, we have the following: 
(1) Assume 77 = 1, i.e., the weight sequence {a c (i)} is of the form 

a c (») = t^t (52) 

1 ~ r J- 

Then, if a c > 2 A ^(kg) ' ^ e norma hzed sequence + l)(u(i) — 6**)} is asymptotically 

normal, i.e., 

y/(i + l)(u(i)-0*)=>Af(O,S c {K)) (53) 
where, the asymptotic variance is given by: 



2 /"OO 



cfo (54) 

Si = -^KG+^Im (55) 
5i = K(1 N ®I M ) T D 1T S C '^ T (1 N ®I M )K T (56) 

(2) Let the hypothesis of the previous assertion hold and choose K c = K* = G^ 1 . Then, the estimator 
{u(i)} is the best linear centralized estimator in terms of asymptotic variance irrespective of the 
distribution of the observation noise £(«). In addition, if the observation noise sequence {C(i}} is 
Gaussian, {u(i)} as defined above, is the optimum centralized estimator, whose asymptotic variance 
S C (K*) equals the centralized Fisher information rate. 

Proof: The proof of the first assertion is omitted due to space limitations (see [1] for similar 
arguments.) That, K c = G^ 1 yields the best linear estimator is standard (see, for example, [24].) ■ 

IV. Main Results 

Theorem 10 Consider a fixed < 70 < .5. Let assumptions (A.1),(A.2),(A.5) hold. 

(1) Consider the QCU algorithm with design parameters (n , o, ti , b, K) satisfying assumption (A.6). 
For each sensor n, the estimate sequence {x„(i)} generated by the QCU is a consistent estimator 
of 9*, i.e., 

P fl . ( lim x n (z) = 9*) = 1, Vn (57) 

(2) Consider a centralized estimator {u(z)} corresponding to a given choice of {a c } and K c . Choose 
K = K c , t\ = r c and T2 satisfying < T2 < t\ — 70 — 2^7, such that, assumptions (A.1)-(A.6) 
hold (such a choice is always possible by Proposition 7.) Also, if 77 = 1, further assume that the 
constant a in assumption (A.6) satisfies 
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For each sensor n, consider the estimate sequence {x n (i)} generated by the corresponding QCU 
algorithm with the above design parameters. Then, for every < tq < t\ — T2 — > we nave 

P fl . ( lim (i + l) To (x n (i) - u(i)) = 0) = 1, Vn (59) 

We discuss the consequences of Theorem 10. The first assertion states that, as long as < 70 < .5, 
any distributed QCU estimator yields consistent parameter estimates at every sensor. By Remark 8, this 
is precisely the class of fading parameters, a centralized estimator can estimate consistently. In other 
words, as long as a centralized linear estimator can consistently estimate a parameter, a distributed QCU 
estimator can. This is interesting, as the range of allowable 70s is independent of the network topology, 
and any random network satisfying the mean connectivity is sufficient. The second assertion quantifies 
the rate at which the distributed QCU estimator converges to the centralized estimator. Again, this rate 
is independent of the network topology. 

The following result (Theorem 11) shows in what sense the QCU algorithm is optimal. We assume 
70 = in what follows. Suitable extensions to arbitrary 70 may be possible, however, this would impose 
added technicalities and digress from the main focus of the paper. Also, the notion of asymptotic variance 
as the metric for comparing different consistent estimators, is not quite clear for nonstationary recursive 
procedures. 

Theorem 11 (1) Recall the positive definite matrix G = Yln=i H^H n . Assume n = 1, i.e., the weight 



sequence {a(i)} is of the form 



a(i) = -~~~zr (60) 
1 + 1 



where a > 2A tjkq^ and K is the positive definite matrix gain that commutes with G. Choose any 
T2 satisfying 

t 2 + —J— < .5 (61) 

2 + El 

and note that such a choice exists as < .5. Consider the QCU algorithm with design parameters 
(ti, a, T2,b, K) chosen above (this ensures that (n, a, T2, b, K) satisfy assumption (A.6).) Then, the 
normalized estimate sequence + l)( x n(^) — #*)} is asymptotically normal for each n, i.e., 

J{i + \){* n {i)-6*)^N{0,S c {K)) (62) 

Here, the asymptotic variance S C (K) is the same obtained by a centralized estimator in Theorem 9 
with gain K c = K. 

(2) Let the hypothesis of the previous assertion hold with the matrix gain K taking the value K* = G^ 1 . 
Then, the asymptotic variance at each sensor is S C (K*), which is the asymptotic variance achieved 
by the best linear centralized estimator (see Proposition 9.) In particular, if the observation noise 
process is Gaussian, the QCU estimator constructed above is asymptotically efficient. 
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We interpret the above. The first assertion implies that given a centralized estimator with matrix gain 
K and satisfying the assumptions in Proposition 9, there exists a distributed QCU estimator achieving 
the same asymptotic variance S C {K). This result is remarkable, as the asymptotic variance S C (K) is 
independent of the network topology L. This is possible due to the mixed time scale behavior resulting 
from appropriate choice of t\,T2- This invariance to the network topology is not achievable by the 
single time scale scheme (n = T2) developed in [1]. In a sense, Theorem 11 justifies the applicability 
and advantage of distributed estimation schemes. Apart from issues of robustness, implementing a 
centralized estimator is much more communication intensive as it requires transmitting all sensor data to 
a fusion center at all times. On the other hand, the distributed QCU algorithm requires only sparse local 
communication among the sensors at each step, and achieves the performance of a centralized estimator 
asymptotically. The second assertion of the theorem reemphasizes the optimality and applicability of 
distributed estimation schemes, and shows that QCU can be designed to achieve the asymptotic variance 
of the optimal linear centralized scheme. In particular, if the observation noise process is Gaussian, QCU 
leads to asymptotically efficient estimators at each sensor. 

V. QCU: Convergence properties 

As noted earlier, the mixed time scale behavior of QCU does not permit the use of standard stochastic 
approximation tools for establishing convergence. Moreover, to be able to establish important qualitative 
properties like asymptotic time scale separation, we need to clearly distinguish the long term effects of the 
consensus and innovations potential. We briefly outline the key steps involved in such a pursuit. We first 
identify conditions under which the sensor estimates {x n (i)} converge to an averaged estimate {x avg (i)} 
over the network and recognize the pathwise (strong) convergence rate. This is carried out in Lemma 15. 
The averaged estimator {x avg (i)} is not quite the centralized estimator {u(i)}, the key reason being the 
averaged local innovations is not the centralized innovation. This leads us to study the rate of convergence 
of the averaged local innovations to the centralized innovation and hence, the convergence rate of the 
averaged estimate sequence to the centralized. This is accomplished in Lemma 16. The analysis in all 
these steps culminate to Theorems 10,11, the main results of the paper. These results identify conditions 
under which the consistent estimate sequences |x n (z)} inherit the centralized convergence rate to 6*. In 
particular, they establish sufficient conditions for the equivalence between the distributed and centralized 
schemes in terms of asymptotic variance. The methodology developed in this work is of independent 
interest and goes beyond the setting of distributed parameter estimation. We envision its applicability in 
the analysis of generic dynamical systems interacting over a network. 

In what follows, we consider the QCU algorithm with fixed design parameters (n, a, T2, b, K) and 
assumptions (A.1)-(A.6) hold throughout. 

We start by establishing pathwise boundedness of the sequence {x(i)}. 
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Lemma 12 There exists a finite random variable R > 0, such that, 



sup ||x(i)|| < R = 1 



Proof: Define the process {y(i)} as 



y(i) = x(i) - 1 



A? 



(63) 



(64) 



The assertion would follow if we establish boundedness for the process {y(i)}. From eqn. (10) we note 
that {y(i)} satisfies the recursion: 

y (i + 1) = (I NM - P(i)L ®I M - a(i) (I N ® *Q-%) y («) - /3(») (£ («) ® /m) y (*) 

+a(i)(/jv (S^z(i) - £>H(ljv ® O) ( 65 ) 
where we use the invariance of the Laplacian operator, 

(L®I M ) (1n®0*) = NM 
Consider the process {^(i)} given by 

V 2 (i) = \\y(i)\\ 2 (66) 
By using the conditional independence properties, it can be shown that, 

E fl . [V 2 (i + 1) | Fi] = V(i) + p 2 (i)y(i) T E e , [Z 2 (i)] y(i) + a 2 (i)E fl . [||S^z(i) - 1^(1* ® 0*)|| 

-2y T (i) (0(i)Z ® I M + a(i) (/jv ® if) y(i) + ?y T (i)(L ® I M ?y{^) 
+a 2 (i)y T (i) ((/at ® K) Djjf ((I N ® K) D w ) y(i) 



+2a(i)/3(i)y T (i)(L ® 55 
We use the following inequalities: 



(67) 



y(i) T E 9 . 



L 2 M 



y(i) = y£x(i)E fl . [L 2 (i) 

2 



< C5||y C x(«; 



(68) 



< A^(X) ||yc-WH 2 (69) 
2y T (i) (0(i)L ® /m + a(t) (7jv ® K) Djj) y(i) > /3(i)y T (i)(I ® J M )y(i) + y T (i) (/?(*)! J M 

+a(i) (In ® ^) i%) y(i) 
> ^)A 2 (I)||y C xW|| 2 + c 4 a(z)||yW|| 2 (70) 

We use Lemma 6 to obtain the last inequality. Introducing additional constants to bound the quadratic 
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forms and the moments, we derive the following from eqn. (67): 

E e ,[V 2 (i + l) \H < V 2 (i)-(/3(i)\ 2 (L)-(3\i)c 5 -f3 2 (i)\ 2 N (L))\\y c ,(i)\\ 2 

- {c 4 a{i) - a(i)p(i)c 7 ) \\y(i)\\ 2 + a 2 (i) 7 2 (*)c 8 + a 2 (i)c 6 ||y(i)|| 2 (71) 

where eg > is a constant, such that, 

a 2 (i)E e , \\\Djj*(i) - D W {1 N 6*)\\ 2 ] = a 2 (i) 7 2 (i)c 8 (72) 

Since /3 2 (i) goes to zero faster than (3{i), the (3(i) term dominates in the second expression of eqn. (71) 
eventually. Similarly, the a(i) term dominates the third expression eventually. Choose eg = max(c6,cs). 
Since, > 1 (assumption (A.2)), there exists i 2 large enough, such that, for % > i 2 

E e . [V 2 (i + 1) | Fi] - V 2 (i) < a 2 (i)j 2 (i)c 8 + a 2 (i)c 6 V 2 (i) 

< c 9 a 2 (i)"/ 2 (i)(l + V 2 (i)) (73) 

Now introduce the process 

oo 

V 2 (i) = (1 + V 2 {i)) J](l + c 9 a 2 (k) 7 2 (k)) (74) 

k=i 

Note that the above is well defined as the product n^ljU + c 9 a2 (^)7 2 (^)) converges for all i due to 
the square summability of {a(i)7(i)} (assumption (A.6)). Eqn. (73) and some algebraic manipulations 
lead to 



V 2 (i + l) | Ti < V 2 (i) (75) 



thus establishing that the sequence {V 2 {i)} is a nonnegative supermartingale. Hence, there exists a finite 
random variable R, such that, lim^oo V 2 (i) = R a.s. We then have from eqn. (74) 

lim V 2 (i) = R- 1 a.s. (76) 

Hence, {V 2 (i)} is bonded pathwise and the assertion follows. ■ 

Remark 13 A deeper investigation of the supermartingale would reveal that V 2 (i) in fact, converges to 
zero. This would have established the consistency of the estimators. However, to obtain strong convergence 
rates, we need to study the sample paths more critically. The rest of this subsection is devoted to this 
study. 

The following lemma identifies the rate at which the estimates converge to a network averaged estimate 
and hence characterizes the information flow in the network. 
Before that, we establish the following: 



Proposition 14 Let assumptions (A.1)-(A.6) hold. 
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(1) For all i G T + , define 

Ji(z(i)) = (/at ® K)Djjv(i) - 1 N ® ( ^ljv ® /m) (/jv ® K)Djjv(ifj (77) 
Then, we have the following: 



^ + 1)70 + ^7+^ 



|Ji(z(i))|| =0 =0 (78) 



(2) Recall the matrix, 



pNM = }_ (ljv /m) (ljy ^ /m) T (?9) 



Then, for i G T + sufficiently large, we have 

\\I NM - (L(i) ® I M ) - P NM \\ = 1 - p(i)\ 2 (L(i)) (80) 
Proof: For the first assertion, consider any £2 > 0. By Chebyshev's inequality and assumption (A.5), 
1 „ , \ 1 



p »-^l^ l|Jl(z(,))ll>E V £ e ^. (i+1) .+<^+., E "- l" Jl(z(i) 

«(fl*) 1 

Since, 5 > 0, the sequence { ^ +1 ^i+ a(2+ei ) } is square summable and we obtain 



|2 + £l 



(81) 



E Pe * ~^~+s II J i( z ( i ))H > £ 2 I < 00 (82) 



It then follows from the Borel-Cantelli lemma (see [25]) that, 

_ / 1 



Ji(z(i))|| >e 2 i.o. =0 (83) 



where i.o. stands for infinitely often. Since the above holds for £2 > arbitrarily small, the claim in 
eqn. (78) holds by standard arguments. 

For the second assertion, we note from the discussion on Kronecker products in Section I-B that, the 
eigenvalues of the matrix (l NM - /3(i) (L(i) ® I M ) - P NM ) are and 1 - (3(i)X n (L(i)), i = 2, ■ ■ ■ ,N, 
each repeated M times. Since, the Laplacian eigenvalues are all bounded above by N 2 and — > 0, 
there exists 14 G T + sufficiently large, such that, for i > 14, (3(i)\ n (L(i)) < 1, for all 2 < n < N. The 
assertion is then obvious. ■ 



Lemma 15 Define the averaged estimate sequence {x avg (i)} as 

1 

N 



XavgW = — (Ijv ® -^m)x(z) (84) 
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Then for every tq, such that, 



we have 



< r < n - T 2 - 70 - 



2 + e 



F e , ( lim (i + l) ro (x(i) - ljv ® x avg (i)) = o) 

\i— >oo / 

Proof: Define the process {yi(i)}: 

y(i) = x(t) - Iat «x avg (i) 

Recall the matrix 



->NM 



1 T 

— (Ijv <8) -Tm) (Ijv ® /m) 



and note that 



P NM X ({) = ljv ® x avg (i), P NM (1 N ® x avg (z)) = ljv (8) x avg (i) 
From eqn. (10) we then note that {yi(i)} satisfies the recursion: 

y(i + l) = (/ 7V M-/3«I^/M-i :,AfM )y(i)-a(i) [(/]v ® K)%x(i) 



+a(i) [Ji(z(i))] 
where Ji(z(«)) is defined in (77). Choose (5 satisfying 

1 

< 5 < Tl - r 2 - 70 - r 

Then, by Proposition 14, we have 

/ 1 



2 + ei 



Ji(z(i))|| =0 =0 



Also, Lemma 12 implies 



8- sup 



(Jjv ® ^Djj-xii) - 1 N ® ^(Ijv ® /m)(/jv ® K)D w -x:{i)^ 



(85) 



(86) 



(87) 



(88) 



(89) 



(90) 



(91) 



(92) 



< oo = 1 (93) 



by the boundedness of {x(i)}. However, these pathwise bounds are not uniform over the sample paths 
and hence we use truncation arguments. For a scalar a, define its truncation (a) Ro at level Rq > by 



(a) 1 



Ro-I R min (H>#o) if a^0 



(94) 



[0 ifa = 

For a vector, the truncation operation applies component-wise. For Rq > 0, we also consider the 
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sequences, {yR (i)} i>0 , given by 



yflo(i + l) = {Inm- P(i)L®I M -P)y Ro (i)-a(i)([(I N ®K)Dgx(i)- 



1n ® ( -^(ljv ® /m)(^jv ® K)D w x({ 



Ho 



+a(i)([J 1 (z(i))}y 
We will now show that for every i?o > 0, 

Fg, (lim(i + ir°(y Ro (i)) =0) =1 
for ro satisfying the hypothesis 85. That, this is sufficient to conclude the assertion 

Fg. ( lim (i + 1) T ° (y(t)) = 0) = 1 

\i—>-oo / 



(95) 



(96) 



(97) 



is a consequence of the following standard argument. The pathwise boundedness of the various terms 
imply that for every £3 > 0, there exists R £3 > 0, such that, 



V sup 

VGT+ 



(7.\ TqDjfxii) - l.\ ( — (1 N ® I M )(I N ® K)DjjX.(i) 



<R £3 \>l-e 3 (98) 



sup || Ji(z(i))|| < R £3 (i + l)^+^r+ 5 > 1 - e 3 



(99) 



For (98) we use the pathwise boundedness of {x(i)} (Lemma 12), whereas, (99) holds because the a.s. 
convergence in Lemma 14 implies convergence in probability. Clearly, the process {y(i)} agrees with 
the process {yn e3 (i)} on the set where both of the above events occur. By standard manipulations, it 
then follows, that 

F e . ^sup ||y(i) - y Rcs = Clj > 1 - 2e 3 (100) 

The claim in eqn. (96) would then imply 

Fg. ( lim (i + 1) T0 (y(i)) = 0) > 1 - 2e 3 (101) 

We could then establish the assertion of the lemma by taking £3 to zero. 

Hence, in the following we establish the claim in eqn. (96) for every R > 0. To this end, consider 
the scalar process {yR (i)}i & j + defined recursively as 

VRoii + 1) = \\Inm - P(i)L(i) - P NM \\ gffloW + NMR a(i) + NMR a{i)(i + l) 7o+ ^ +5 (102) 
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with initial condition y Ro (0) = ||yj? o (0)||. Since, 

11^(^ + 1)11 = \\l NM -P(i)L®I M -P NM \ 



|yiioWII-«W||([aAr®^)^x( 

Ro 



In ® ( jj(lN ® /m)(/jv ® ^D^i 



+a(i) 



it follows that, 



||yflo(«)ll ^ s/flo(0> v» 

By Proposition 14, for i large enough, it can be shown that 

|| JiVM - P{i)L®I M ~ P NM \\ = 1 - 
We assume w.l.o.g. that the above holds for all i. We then have 

VRoii + l) < (l-mHL(i)))yR<Xi) + NMR oa (i) + NMR a(i)(i + iy o+ ^ +S 



(103) 



(104) 



(105) 



< (1 - P(i)\ 2 (L(i))) y Ro {i) + 2NMR a(i)(i + l) 7o+ ^i 

The above implies 

y Ro (i + 1)<(1- /?(i)A 2 (L(i))) (y Ro (i)) + 2NMR 



+5 



1 



+ 



Ti— 7o- 



(106) 



(107) 



2 + ei 



Using a result from [26], we note that \2(L) > implies Eg. [A2(i(i))] > (note that this equivalence 
is not a consequence of Jensen's inequality, as the second eigenvalue is a concave function of the graph 
Laplacian.) The recursion in eqn. (108) then falls under the purview of Lemmas 4,5 (see eqns. (85,91)), 
and we have 

F e ,(\im(i + iy°y Ro (i) = 0)=l (108) 
It then follows from eqn. (104) that 



(lim(i + ir>y Ro (r)=0) = 1 



(109) 



The assertion is then immediate. ■ 
Lemma 15 characterizes the proximity of the sensor estimates {x n (i)} to the network averaged estimate 
{x avg (i)}. To infer the convergence of the sensor estimates to 6*, it then suffices to study the limiting 
properties of {x avg (i)}. This is achieved in two steps. In the following, we consider the class of linear 
centralized estimators of the parameter 9, and establish its relation to the network averaged estimator 
{x avg (i)}. In particular, we investigate the rate at which {x avg (i)} converges to the class of centralized 
estimators. Properties of the centralized estimators are then used to infer the convergence of {x avg (i)} 
(and hence, that of {x n (i)}) to 9*. 
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The following result is the first step towards characterizing the convergence rate of the network averaged 
estimator {x avg (i)} to 6*. It establishes the relation between {x avg (i)} and the class of centralized 
estimators {u(i)} introduced in Definition 2. 

Lemma 16 Let {u(z)} be the centralized estimate sequence defined in 2 with r c = t\, a c = a and 
K c = K. Then, 

(1) 



?. ( lim llxavafi) - u(i)|| = ) = 1 



\i— YOO 



(2) Let tq satisfy the assumption 



(110) 



< Tq < Ti - T 2 - 70 



2 + £l 

Also, if t\ = 1, assume that the constant a in assumption (A.6) satisfies 



a > 



Nt 



Amin(^G) 



Then, 



lim (i + l) To (x avg (z) - u(i)) = 



Proof: We note that the averaged update may be written as 

X avg (* + 1) = X„(i) + ^Jf X)^) - ^E^-^W 



N 



&(%) \ A / rp rj-, \ 

= x avg (z) + -j^-K^ (H n z n (i) - H n H n x avg (i)j 



n=l 
N 



n=l 



n=l 



a(i) 



N 



K H n R n (*n W - x avg(i)) 



n=l 



Define the process {u(i)} by 



u(i) = x avg (i) - u{i) 



(111) 



(112) 



(113) 



(114) 



(115) 



We then have 



u(i + 1) = (I M - ^KG)u(i) 



N 



a{i) 
~N~ 



N 



K^H^Hn (x n (z) - x avg (i)) (116) 



Now choose S, such that, 



< 5 < T\ - t 2 - 70 - r 



1 



2 + £i 



Since tq + <5 < t\ — r 2 — 70 — 2qrj7' by Lemma 15, it follows that, 



lim (i + 1 



1 t +5 



AT 



#n #n ( X n (i) " X avg (i)) 



n=l 



0=1 



(117) 



(118) 
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Then, there exists a finite random variable R%, such that, 



N 



71=1 



< R 3 (i + 1)- Ta ~ s Vt G T+ I = 1 



(119) 



Note, by hypothesis, the matrix KG is symmetric and a(i) — > 0. Hence, there exists a constant cio > 0, 
such that, for sufficiently large i, 



Im - ^KG 



< 1 - cioa(z) 



(120) 



Writing w-wise and introducing another constant c\\ > 0, we have 

||u(i + l,w)|| < (1 - c w a(i)) ||u(i,w)|| + cna(i)123(w)(i + l)-" " 5 

for i greater than some sufficiently large i 4 (uj). We then have 

||u(i + l,w)|| < (1 - c KG a{i)) \\u{i,u>)\\ + c u R 3 (uj)(i + 1)- T ^ T ^ 5 (121) 

A pathwise (fixed to) application of Lemma 4 and Lemma 5 and noting that the above holds for wina 
set of full measure yield the assertions. ■ 

VI. Proofs of main results 

A. Proof of Theorem 10 

Consider the first assertion. Since the QCU parameters (n, a, T2, b, K) satisfy assumption (A.6), we 



note 



•5 + 70 < T\ < 1 



(122) 



Choose t c = n and K c = K. It then follows that the centralized estimator {u(z)} (Definition 2) with 
design parameters is good. Hence, by Proposition 7 it is consistent, i.e., 



P fl . ( lim u(i) = 6*) = 1 

\i— >oo / 

Taking tq = in Lemma 15, we have 

P fl . ( lim (x(i) - ljv ® x avg (i)) = 0) 



(123) 



(124) 



The first assertion of Theorem 10 is then an immediate consequence of (123)-(124) and Lemma 16 (first 
assertion.) 

The second assertion of Theorem 10 is a direct consequence of Lemma 15 and Lemma 16 (first 
assertion.) 
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B. Proof of Theorem 1 1 

By hypothesis of Theorem 1 1 , we have 

n = l, ^— + t 2 <.5 (125) 

2 + £i 

Hence, T\ — t 2 — > -5. Since, a > 2 a N (kg) ' tnere exists £5 > 0, small enough, such that, 

By the above, we can always choose To satisfying the condition: 

.5 < r < max ( .5 + e 5 ,n - r 2 - 7— — J (127) 
V 2 + eW 



For such To, we clearly have a > . ^t#t^ » an d hence by Theorem 10 (second assertion), we conclude 

P„. f lim (t + 1) T0 (x n (i) - u(i)) = 0) = 1 (128) 

where {u(i)} is the centralized estimator with design parameters (r c , a c , K c ), such that, a c = a, t c = t\, 
K c = K. It then follows by Proposition 9, that, 

^/(i + l)(u(i)-e*)^N(0,S c (K)) (129) 

Since, To in (128) is strictly greater than .5, the sequences {x n (i)} and {u(i)} are indistinguishable in 
\f(i + 1) scale, and it can be shown using standard properties of stochastic convergence, that, 

7(i + l)(x„(i)-r)=^AA(0,5 c (K)) (130) 

The second assertion follows by choosing K = K* in the first. 

VII. Conclusions 

The paper considers gossip linear estimation of an unknown large dimensional parameter (or large 
scale static random field) observed by a sparsely interconnected network of sensors operating under 
the gossip communication protocol. We consider this problem under very general conditions on the 
noise assumptions and communication failures (including, link or channel failures, besides the usual 
measurement noise assumptions.) Due to the large scale of the field, the sensors are local, i.e., they 
observe only a small fraction of the field. To obtain a global estimate, the sensors need to cooperate. 
The class of gossip distributed linear estimators we study combines two terms: a consensus term that 
updates at each sensor its current estimate with the state estimates provided by the neighbor(s) when 
they gossip; and an innovations or sensing term that updates the current sensor estimate with the new 
observation. The linear gossip distributed estimators that we analyze exhibit a mixed time scale-one 
that is associated with the consensus and the other with the innovations. This forces us to develop new 
analytical tools to establish their asymptotic properties. This is because in gossip distributed estimation, 
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the innovation term is not a martingale difference process, as in previous work on mixed time scale 
stochastic approximation algorithms, e.g., [4]; so, a key step in our analysis is to derive pathwise strong 
approximation results to characterize the rate at which the innovation process converges to a martingale 
difference process. The paper establishes a distributed observability condition-global observability, a 
condition on the sensing devices, i.e., the local measurements, plus mean connectedness, a structural 
condition on the communication network as provided by gossip. We show that under this condition 
the distributed estimators performance approaches the asymptotic performance of the optimal centralized 
estimators, namely, the distributed estimators are consistent and asymptotically normal. This is significant, 
as it shows that, under reasonable assumptions, a distributed gossip estimator is as good as a centralized 
one, the latter having access to all sensor observations at all times. As mentioned, the distributed gossip 
estimator has two time scales, which involves setting two gain sequences, one for the local innovations 
at each sensor and the other for estimate fusion (consensus) across sensors. To design good distributed 
gossip estimators, these gains should be chosen properly, namely, the consensus gain should decay at a 
slower rate than the innovation gain. In the absence of quantization or channel noise, the paper shows that 
it is possible to choose the consensus weight sequence such that its squared sum goes to oo, in contrast 
to the innovation weight sequence whose squared sum needs to be finite. This tuning of the different gain 
sequences leads to an asymptotic time scale separation, the rate of information dissemination dominating 
the rate of reduction of uncertainty by observation acquisition. This is not possible with quantized or 
noisy transmissions, as each consensus step introduces noise, preventing proper adjustment of the gain 
sequences. The paper interprets the fundamental convergence results on distributed gossip estimation 
in two interesting contexts: 1) when the observations are (conditionally) independent, the distributed 
estimator achieves the same performance (in terms of asymptotic variance) as the best centralized linear 
estimator; and 2) the maximum rate at which the observation noise power (variance) can increase with 
time and still the estimators to remain consistent is the same for the centralized and the gossip linear 
distributed estimators. 
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